Omada: robust clustering of transcriptomes through multiple testing

Abstract Background Cohort studies increasingly collect biosamples for molecular profiling and are observing molecular heterogeneity. High-throughput RNA sequencing is providing large datasets capable of reflecting disease mechanisms. Clustering approaches have produced a number of tools to help dissect complex heterogeneous datasets, but selecting the appropriate method and parameters to perform exploratory clustering analysis of transcriptomic data requires deep understanding of machine learning and extensive computational experimentation. Tools that assist with such decisions without prior field knowledge are nonexistent. To address this, we have developed Omada, a suite of tools aiming to automate these processes and make robust unsupervised clustering of transcriptomic data more accessible through automated machine learning–based functions. Findings The efficiency of each tool was tested with 7 datasets characterized by different expression signal strengths to capture a wide spectrum of RNA expression datasets. Our toolkit’s decisions reflected the real number of stable partitions in datasets where the subgroups are discernible. Within datasets with less clear biological distinctions, our tools either formed stable subgroups with different expression profiles and robust clinical associations or revealed signs of problematic data such as biased measurements. Conclusions In conclusion, Omada successfully automates the robust unsupervised clustering of transcriptomic data, making advanced analysis accessible and reliable even for those without extensive machine learning expertise. Implementation of Omada is available at http://bioconductor.org/packages/omada/.


Introduction
The rapid development of next-generation sequencing boosted the quantitative analysis of gene expression in a variety of human tissues and organs 1,2 generating valuable resources 3 for downstream investigative analysis.In recent years, such analyses aim to elucidate disease mechanisms 4 and construct genomic profiles 5 to explain diagnosis 6 , prognosis and treatment patterns.However, transcriptomic profiles can be heterogeneous due to several causes pertaining to technical biases that produce batch effects 7 , cellular diversity 8 , disease heterogeneity 9 as well as differences between individuals and populations 10,11 .In turn, this heterogeneity hinders traditional research efforts 12 aiming to define structures especially under complex diseases which led to the utilisation of the field of machine learning towards this data demanding goal.More specifically, unsupervised machine learning i.e. clustering, in the form of transcriptomic profiling based on sequencing data [13][14][15] , has been explored in terms of symptomatic heterogeneity in complex diseases revealing differences in molecular states [16][17][18] and phenotypes described by the gene expression of diseased tissue.However, deep medical and molecular knowledge is required to identify solvable problems and interpret results within the context of various diseases and conditions.Simultaneously, specialised knowledge and experience is needed to create functional, efficient and insightful models which generate reproducible solutions.Despite the inherent power of these models, most times a default model is not sufficiently tuned to the specific dataset thus unable to extract essential information.Many models have been compared, tested and found to work on different data and research questions 19,20 highlighting that no single model is always optimal without tuning (or optimising) on the specific dataset at hand, especially with state-of-the-art methodologies 21,22 .
Machine learning (ML) is currently being used in many forms and combinations 23 , for different types of projects within diverse fields of biomedical research [24][25][26] .Supervised and unsupervised methods are being developed to address specific questions and/or data problems as the pace of new data generation increases rapidly.Big data has made the importance of tailored methodologies essential for specialised datasets 27,28 , as speed and accuracy pose an even greater obstacle, especially when handling sizable medical data.The impact of machine learning in biomedical sciences has risen considerably with the multitude of methodologies leading to previously unfeasible computations 29,30 .
Unsupervised learning proved to be an invaluable tool towards exploring heterogeneity in complex diseases since its functioning without any prior knowledge or assumptions of sample labels.Due to this diversity, there is a need for methods that support non-expert users to utilise the characteristics of various methodologies in their unsupervised work.One of the most important aspects of sample partitioning is the stability of the generated groups as unstable clusters, usually imply the lack of signal which should be present and drive the clusters.Signals can take many forms, for example the level of gene activity in RNA sequencing datasets.Cluster instability can be caused inherently by the data points or by the type and application quality of a clustering technique.
Most clustering packages offer implementations of one or more algorithms without considering the previous and subsequent steps of the clustering process (such as detecting useful features or estimating the number of subgroups in the data).Since clustering consists of multiple steps (as shown in the table below) it is important to carefully approach each step so that intermediate decisions are not arbitrarily but based on appropriate clustering theory.A number of clustering methodologies are shown in Table 1 along with the steps of the clustering process they address.With the above obstacles in mind, we introduce Omada, a toolkit with multiple functions based on cluster stability and machine learning formulas to provide assistance to both experienced and inexperienced users during the steps from dataset assessment to the formation of the subgroups.In contrast with other current methodologies, Omada is not simply offering multiple metrics and algorithms to choose from but estimates the ones that provide the most robust results based on the dataset used at the time.Each function's results are based on machine learning theory and multiple metrics to ensure that a wealth of methods will be considered in the decision and clustering process.

Methods
This toolkit consists of a pipeline that takes in a gene expression matrix to identify transcriptomic subgroups of samples (Figure 1 and Supplementary Figure 1).Starting from a matrix of gene expression values (e.g. transcripts per million from RNAsequencing), the most suitable clustering method is chosen, followed by selecting the transcript features for clustering and determining the number of clusters and memberships.Processing and quality control should precede the application of Omada to ensure more accurate inputs.
Step 1 of Omada allows a feasibility analysis to ensure that the input data is suitable for clustering.
Step 2 selects the clustering methodology that provides the most consistent partitions for downstream analysis.In step 3, the genes that best discriminate samples and yield the most stable clusters are determined.
Step 4 estimates the number of clusters (k) through majority voting by internal machine learning indexes.The final output of step 5 provides a cluster assignment to each sample driven solely by its expression profile.Percentage thresholds are recommended values that can be modified.

Sample and gene expression preprocessing
A preprocessing step is recommended by the user before the application of these tools on any dataset to heighten the chances of any underlying important signal to be discovered.Data biases and format can often drive clustering attempts to focus on discriminating data points based solely, or mostly, on known information producing no new insights irrespectively of the method used 21,39 .To address this, it is recommended to attempt to remove/normalise any data points that might be introducing strong biases to allow the novel signal to be detected.Furthermore, numerical data may need to be normalised in order to account for potential misdirecting quantities (i.e.outliers) or specifically transformed to satisfy an algorithm's input criteria.Data points or samples have to be filtered based on field knowledge to allow the data to answer specific scientific questions.Expression data should go through proper quality control depending on the manner of collection to identify outliers and remove unreliable datapoints.For microarrays it is important to assess sample, hybridization and overall signal qualities along with signal comparability and potential biases 40 .Array correlations through PCA and correlation plots should also be considered 40 .RNA sequencing experiments also produce data that need to be controlled for potential trimming of adapter sequences, low quality reads, uncalled based and contaminants by using a plethora of available tools 41,42 .Additionally, qPCR generated data should be checked for abnormal amplification, positive and negative control samples, control on PCR replicate variation and determine reference gene expression stability and deviating sample normalisation factors 43 .As for the number of genes, it is advised for larger genesets (>1000 genes) to filter down to the most variable ones before the application of any function as genes that do not vary across samples do not contribute towards identifying heterogeneity.Moreover, large genesets require increased computational power and extended runtime without adding any real value due to the large number of non useful genes.Lastly, it is important to note that technical artefacts, such as sampling location or machine specifications, may drive clustering causing the formation of very distinct clusters which can solely be attributed to relevant biases.It is very important for those cases to be identified and extracted insights should be disregarded as they do not reflect real signals or data trends.

Determining clustering potential
At the start of each study, we assess the suitability of the input dataset for clustering to ensure general dataset attributes do not influence the process (Figure 1).The number of samples and features (e.g.genes), as well as the balance of the two dimensions, directly affects the capabilities of clustering methods to handle the dataset.An inadequate number of samples does not provide enough training power 44 , while an overabundance of samples might clutter the provided information and confuse most methodologies 45 .Similarly, too few features can lead to weak clustering criteria and too many features might lead a methodology away from the features that can really differentiate between clusters of samples.Therefore, to estimate the feasibility of a clustering procedure on a specifically sized dataset we rely on measurable metrics of cluster quality, such as stability.Clusters of high stability denote both a partitionable dataset as well as a dataset-suitable methodology 46 .The feasibility score of any dataset is a function of both dimensions as well as the number of classes requested.As such, if too many or a single class is requested of a relatively small dataset the calculation will reflect low feasibility due to insufficient samples and/or features to form the desired classes.

Simulating datasets
To assess the quality of the dataset to be used, our toolkit includes two functions for simulating datasets of different dimensionalities for stability assessment.We use those to understand the relation between the number of samples, genes and cluster sizes.The first function simulates datasets allowing for selecting the number of samples (n), genes (m) and clusters (c).Each cluster contains

Intra-method Clustering Agreement
Unsupervised learning offers a multitude of methods to be applied on specific types of data due to their nature (e.g.numeric, binary) or underlying signal to be detected.Most studies employ widelyused methods (e.g.hierarchical clustering) without exercising any kind of selection method that would point towards the most effective methodology.Selecting an appropriate approach requires extensive machine learning and data analysis knowledge coupled with tuning and testing of multiple different algorithms.To enable users without ML expertise to utilise the vast capabilities of this field and avoid limited efficiency of default methodologies, we present a clustering selection tool that offers an intelligent selection method with unbiased results through parameter randomization.The nature of this selection method allows any number of well established unsupervised methods to be considered.
To address the lack of class labels and thus a performance measure in unsupervised models, we compare how consistently different approaches partition our data when one or more parameters change.As high consistency we define the high agreement score calculated between different variations of a clustering algorithm.When two different clustering runs agree on the partitioning of the samples they also show robustness since they do not randomly assign samples to subgroups but rather are driven by the underlying structure of the data.
We implemented a tool (Figure 1) to calculate an average agreement score per clustering approach by comparing a number of runs within each of the three clustering approaches (hierarchical 47 , kmeans 48 , spectral clustering 49 ) using multiple parameters (kernels, measures, algorithms) specifically based on the data set provided.The number of comparisons (c), between runs of the same approach, is an additional overarching parameter of this tool and contributes to the agreement score.For each comparison, the parameters of the two runs are drawn randomly from a predefined set (Table 2) selected randomly with replacement while not allowing the same parameters to be used within one comparison.In the interest of performance and computational time we suggest three comparisons to be used.Depending on c, we generate variations of the base clustering algorithms (package kernlab v0.9-29), along with the various distance measures and clustering categories they belong to.Within each pair of clustering runs the agreement is calculated using the adjusted Rand Index (package fossil v0.3.7), the corrected-for-chance version of the original Rand index 50 , which is based on the number of times any pair of points is partitioned in the same subgroup throughout different clusterings runs.
To calculate the agreement within each clustering algorithm (spectral, k-means, hierarchical) we are considering pairs of runs using the same algorithm but different parameters.For those pairs the agreement is averaged across clustering runs and k number of clusters tested.The algorithm that presents the highest intra-method agreement over a logical range of clusters ( ∈ [2, ]) is noted as the most appropriate clustering of the samples based on a detected signal.A logical range of k is considered a set of successive k's (where k⩾2) that is most probable to exist within our data, often determined by prior knowledge of the data, previous studies or domain expertise.This selection procedure is mainly affected by the type and size of the data leading similar datasets to opt for the same method due to the specific mathematical formulas within each algorithm.

Spectral clustering algorithm 49
Given a set of points S = {s1 ... ,sn} in R l that we want to cluster into k subsets: 1. Form the affinity matrix A∈ Rn✖n defined by Aij = exp(-||si -sj||2 /2σ2) if i ≠ j , and Aii = 0

Feature set subsampling
While gene expression data provide measures on the thousands of transcripts in the transcriptome, not all of them may provide discriminative information on the samples and may not be useful for clustering.Moreover, most clustering algorithms are heavily affected by a large number of features both computationally due to input size and in performance due to misdirecting data noise 51 .A common strategy to select interesting and potentially useful RNA features is to measure their variance across samples and select the ones with the highest scores instead of those that are either housekeeping or do not differentiate in our context.In this tool, we exclude RNA features that remain stable across samples and are therefore unable to offer any discriminatory power to our unsupervised machine learning models.Furthermore, the exhaustive feature selection procedure incrementally considers all the genes in the feature set and takes into account the stability of all generated test clusters and number of cluster ranges.

Stability-based assessment of feature sets
To assess the suitability of each resampled feature set for our clustering, we measure the average stability of the clusters they generate per run when a clustering method is applied over a range of k's (Figure 2B).First, the clustering range, where the stability of each dataset will be calculated, is selected.For each dataset we generate the bootstrap stability for every k within range.To calculate each bootstrap stability score the data is randomly sampled with replacement and clustered internally using a spectral approach.We then compute the Jaccard similarities between the original clusters and the most similar clusters in the resampled data.The above procedure results in a stability score for each k and each dataset.We then calculate the final stability of each dataset by averaging the stability over k.The genes that comprise the dataset with the highest stability are the ones that compose the most appropriate set for the downstream analysis.

Choosing k number of clusters
Most clustering methods require the number of k clusters to be defined as a parameter before the application of the algorithm on the data.The lack of a concrete way to determine the real number of clusters in a dataset led many studies to base their estimation on field/prior knowledge or various estimation methods such as the Silhouette score 52 .However, each method favours different aspects of the generated clusters (i.e.how compact clusters are and how far apart cluster centres are) and therefore suits specific datasets and may introduce bias towards the selection of k.To encompass these different angles in one methodology, avoid the risk of selecting an ineffective index and present a more general solution, this tool uses an ensemble learning approach (Figure 1) where multiple internal cluster indexes contribute to the decision making 53 .This approach prevents any bias from specific metrics and frees the user from making decisions on any specific metric and assumptions on the optimal number of clusters.
Initially, the value of the 15 indexes is calculated for each k within a cluster range of [2, ], where x is a logical upper limit of the number of clusters realistic for our dataset.The means over k are calculated per index and the optimal k is estimated by majority voting of the 14 means that evaluate the compactness and/or the distance between different subgroups.The selection of indexes can be found in Table 3.It is important to note that the most important aspect of determining k is minimum loss of information which directs us to overestimate and not underestimate k 51 while interpreting the voting results.Furthermore, cases that present only a single k as the optimal number of clusters should be treated with caution in case they are a result of a biased dataset.

Optimal parameter tuning
Previous steps have selected the optimal method, number of features and clusters.To perform the optimal clustering we automate the selection of parameters for each method so that manual tuning is not required.Towards that goal we utilise cluster stabilities to decide on the parameters (which depend on the specific algorithm i.e. kernels in k-means and spectral clustering, linkage method in hierarchical clustering) selected by this toolkit.All available parameters (Table 2) participate in the selection procedure where we measure the average bootstrap stability of the clusters (clusterboot function in R package fpc v2.2-3) using the previously determined optimal k and feature set for each parameter.The parameter that produces the highest stability is used for the optimal clustering run.

Test datasets
Five datasets were used to validate different capabilities of the Omada package.First, two datasets were simulated by Omada's functions.Function feasibilityAnalysisDataBased() was used to generate a multi-class dataset with 359 samples and 300 genes based on the contents and dimensions of the original RNAseq data 18  The mRNA expression was in the form of z-scores relative to normal samples where we applied an extra step of arcsine normalisation.This type of transformation calculates a (proportional) pseudocount per gene instead of using one constant overall pseudocount, which creates a compression effect that also accommodates genes with low expression values.After filtering for tissue-specific genes 69 for the three cancer types we retained 243 genes.Next, we utilised a pulmonary arterial hypertension (PAH) dataset (25,955 genes) generated from 359 patient samples with idiopathic (IPAH) or heritable (HPAH) nature.The transcriptomic data can be found in the EGA (the European Genome-phenome Archive) database under accession code EGAS0000100553265 70 (restricted access) and all pre-processing details and parameters used can be found in 18 .Finally, we used an RNA dataset from the whole blood of 238 mothers during midgestation (26-28 weeks of pregnancy), where whole transcriptome library was constructed based on Illumina's standard protocol and quantified using real-time PCR.Sequencing was based on Illumina HiSeq 4000 system.Read counts were extracted from GEO (accession number GSE182409 71 ) and were then read into R and converted into TPM using the convertCounts function available in the DGEobj.utilspackage.For the purpose of clustering, we mapped the TPM dataset to the list of 24,070 genes used in the PAH dataset described in a previous section.

Results
Omada was applied to five diverse gene expression datasets to demonstrate its utility in guiding cluster analysis and identifying plausible subgroups of samples.Two datasets were simulated by our tools.The simulated dataset with multiple distinct classes was used to determine Omada's ability to accurately estimate k with reasonable stability when we know the existence of sample classes.In contrast, samples in the single-class simulated dataset were drawn from a single class and used to demonstrate the toolkit's ability to point towards the lack of sample subgroups by indicating inconclusive low scores throughout the analysis.A multi-tissue PANCAN dataset was introduced to assess Omada's capability to generate signal-based clusters that closely follow the tissue-specific patient sample distributions.In addition, to determine whether Omada can identify distinct heterogeneous subgroups from data without any prior classification information but potential present heterogeneity, we used a whole blood RNAseq dataset from patients with pulmonary arterial hypertension (PAH) 18 .Lastly, implementation of the toolkit on a whole blood expression dataset collected from women during pregnancy (GUSTO) was included to demonstrate a case with potential technical biases and no known subgroups since it is composed of healthy participants.
For the above, we measured its consistency on algorithm, feature and number of clusters (k) selection and the stability of the generated clusters for a particular k (stabilityk) and the average across k's (stabilityavg).It's important to note that the value of this validation is derived from the fact that unstable clusters should not be interpreted as this instability comes from problematic data or an incorrect approach.However, cluster stability only provides a mechanistic way to assess the underlying data structure and further information is required to fully biologically validate the clusters 46 .

Single-class dataset: Homogeneously simulated dataset
To demonstrate Omada's ability to identify datasets without any present clusters where all patients belong to one class, we used the single-class simulated dataset (see Test datasets in Methods).All potential k of two or higher achieved low scores with average and maximum stabilities of 45% and 55%, respectively (Supplementary Table 1).It is recommended to avoid clustering analysis on such low score datasets and instead opt for scores of at least 60%.Ideally, stabilities of 80-90% are considered very strong 72 , however the potential of several signals in transcriptomic data and the exploration across multiple k generally decreases the output stability to an acceptable threshold of 60-70%.Next, Figure 3A shows the overall low partitional consistencies (averaged over all tested k) for all algorithms with spectral average partition agreement of 52%, kmeans average partition agreement of 3% and hierarchical average partition agreement of 26%.With the best performing algorithm showing an agreement of around 50% we can assume that the tested algorithms are randomly assigning memberships, therefore we cannot achieve a robust model with the current data.When using spectral clustering to select the most appropriate set of genes, the cluster stability rapidly dropped below 50% when using more than 20 genes (Figure 3B) indicating that the algorithm got worse in assigning memberships as we considered more simulated genes.Finally, the ensemble voting step showed the majority of the votes supporting five clusters (Figure 3C), a significant variation from the single simulated class of this dataset.In such unexpected outputs, one should examine the generated metric scores (Table 3; Supplementary Table 3).Multi-class dataset: Five distinct simulated expression classes representing heterogeneity Omada's basic function is to help identify samples that come from different sources and group together samples that come from the same source.Towards that end, we simulated a dataset with five sets of expression profiles with 50 samples each and 120 genes sourced from five unique distributions of expression data that represent heterogeneity within our samples.The means and standard deviations of each class are presented in Figure 4A, depicting the expression differences.
Additionally, the empirical Cumulative Distribution Functions (ECDFs) of the five simulated classes (Figure 4B) as well as the high average Kolmogorov-Smirnov distances (Davg=88.3%,Supplementary Table 2) show distinct differences between the distributions in respect to the expression in the simulated RNAseq dataset.To demonstrate the effect of different sample and gene numbers, multiple datasets were simulated with an increasing number of samples and genes (Figure 4C).The calculated cluster stabilities, where each value represents the stability over a range of k and a specific number of samples and features, show five or less samples per class provide highly unstable and unreliable clusters.The minimum acceptable stability threshold of 60% was achieved with at least 20 samples and a reliable stability of 75% was achieved using 1000 samples.The majority of metric scores are worse in testing single-class instead of multi-class simulations (Supplementary Table 3), especially in lower compactness and shorter distance between clusters.

Figure 4: A) Expression boxplots for the five clusters showing the means and standard deviations B) The cumulative probability (as calculated from the empirical cumulative distribution function) for the five clusters calculated by a two sided Kolmogorov-Smirnov Test C) Average over-k stabilities for simulated datasets of increasing sample and gene numbers. A small number of samples consistently provides extremely unstable clusters (orange) while increasing both numbers consistently produces datasets that pass the stability threshold of 60% (blue).
To test the ability of the clustering tools to produce stable clusters in various contexts we first apply them in sequence on strategically simulated data.The data are composed of distinct classes (based on class mean mclass and standard deviation sdclass) and due to that strong signal our tools are expected to determine an accurate k with reasonable stability, scoring above 60%.To allow for a more direct comparison, we used a multi-class simulated dataset from random sampling of real RNAseq data 18 .
When considering ranges of k we are using [2, 6] clusters to observe a broader range of results for comparison reasons.The clustering feasibility tool showed that the highest stability was 78% (Supplementary Table 4), providing a strong indication of stability across our clusters.Since we selected a limited range of  ∈ [2,6] where the stability should remain high, the averaged-over-every tested -stability (stabilityavg) of 72% indicates a dataset of adequate size and class definition to proceed to clustering analysis.It should be noted that when large ranges of k are selected the average stability will naturally decrease as the calculations will take into account k's much larger or smaller than the actual number of classes in the data.In such cases the user can review the individual k stabilities generated as part of this tool to conclude whether those values are satisfying i.e a minimum of 60%.Next, we calculated the partitioning agreement of three clustering algorithms and spectral clustering showed the highest average score of 56% (Figure 5A and Supplementary Table 4).
Partitioning agreement scores should be interpreted across algorithms applied on the same dataset rather than as absolute values keeping in mind that a score below 50% represents a random partitioning and subsequently a non-robust clustering.In the subsequent feature selection step, the highest average stability was registered when using all 300 features (stabilityavg = 78%, Supplementary Table 4), not discarding any feature as they all demonstrated very similar variance due to the nature of the simulated data.Finally, 8 out of 15 internal metrics voted five clusters as the optimal number during the k estimation step (Figure 5B) providing a confident estimation above 50%.

Discriminating cancer types from pan cancer tissue expression data
An integral capability of Omada is to accurately stratify patients according to any biologically relevant signal present in expression data and detect differences stemming from genes, pathways, tissues etc.Real multi-tissue samples are often the focus of exploratory studies as they present celltype differences but still unknown factors that may discriminate them.Using expression data from multiple cancer types (Pan-cancer dataset as described in Test datasets in Methods), we expect our tools to identify clusters that are consistent with the samples' tissues of origin.Due to the different types of tumours we explored the potential cluster range of [2, 5] for each pipeline step.The clustering feasibility of the dataset (2244 samples, 243 genes) presented an average stability of 88% and maximum stability of 100% (Supplementary Table 5) providing confidence for the downstream analysis.Spectral clustering showed the highest consistency (partition agreementavg = 63% closely resembling the simulated multiclass dataset, Figure 5A) and was therefore deemed as the most robust.
In this example hierarchical clustering showed high instability, as shown in Figure 5A, demonstrating the importance of selecting the appropriate algorithm to create a robust model.According to our selection tool, all 243 genes produced the most stable set of clusters with a stability of 96% (Supplementary Table 5) which coupled with the high algorithm robustness indicated a model that most likely detects a signal in the data.Additionally, a very important observation is that all genes were deemed important to produce nearly perfectly stable clusters agreeing with the filtering of genes based on the cancer type annotations we performed prior to this clustering analysis.The ensemble voting tool estimated our dataset to contain three clusters of samples with the support of 57% of the metrics (Figure 5B).When comparing these results with the simulated five-class dataset, both achieved higher certainty on the five clusters (>50%, Figure 5B) reflecting the rigid differences between the clusters when dealing with cancer tissues and simulated classes.In the case of the PANCAN partitioning, the breast, lung and colon/rectal samples almost perfectly grouped in their respective clusters (Figure 5C).

RNAseq data from diseased tissue with unknown heterogeneity
It is important for Omada to be able to robustly identify patient subgroups when heterogeneity for the cohort has not been previously characterised.We applied our tools on such a dataset (PAH dataset as described in Test datasets in Methods) to assess whether they can still produce stable clusters that differ in terms of expression profiles and other phenotypic measures.The feasibility for this dataset's simulation showed an average stability of 61% and a maximum stability of 74% both acceptable to proceed with the clustering analysis (Supplementary Table 6).A notable 13% difference between average and maximum stability provides a positive indication that a specific k might prove significantly more stable downstream.The spectral clustering technique recorded the highest partitional consistency (partition agreementavg = 86% and partition agreementmax = 96%) when we examined each algorithm's partition agreement for up to ten clusters.The bootstrapping subset selection tool estimated the 300 most variable genes as the most stable clustering parameter with a maximum stability of 73% (Figure 6A) showing an impressive reduction from the initial gene set (25,955) and ensuring the removal of a lot of data noise.According to the ensemble voting tool two clusters were voted by 71% of the internal metrics followed by k = 3 (14%) and k = 5 (7%).Despite the strong indication of two clusters, k = 5 was selected to prevent loss of information occuring when smaller embedded clusters are disregarded.As shown by the downstream analysis, fully presented in 18 , selecting the higher k, even as a second estimate, allowed us to detect strong expression profiles.
After considering cluster sizes the three predominant subgroups showed significant differences in expression, immunity and survival profiles as well as risk category distributions (Figure 6B, C).

RNAseq data from healthy whole blood tissue
Next we tested how Omada would discriminate samples from healthy individuals from a single tissue type.Generally, in studies based on a dataset with no discernible heterogeneity to be explored -i.e a dataset without patients of dissimilar outcomes or controls -clustering algorithms may not be robust and may generate variable results.Useful partitioning might still be formed, such as unforeseen disease subgroups, but these observations must be validated.Towards that end, we used the GUSTO RNA whole blood dataset of 238 mothers obtained during mid-gestation, as seen in Test datasets in Methods.During determining clustering potential our simulated dataset showed stabilityavg = 56% and stabilitymax = 59% (Supplementary Table 7), a similar low-stability score as in the simulated single-class (45% and 55%).We examined a k-range of [2, 5] where spectral and k-means clustering showed very similar internal average partitional agreements of 61% and 60% and very high maximum agreements of 93% and 88%, respectively.The extremely high agreement scores should be interpreted with caution as they might not reflect a very strong signal but an underlying bias that partitions samples in similar groups repeatedly, over-powering the parameter changes.The 50 most variable genes were estimated to produce the most stable clustering with maximum stability = 71% (Figure 6A).Similarly to the agreement scores, a small number of genes driving the most stable clusters (starting from 24,070 genes) might indicate either a strong expression signal or a pre-existing bias.When estimating the number of clusters, two (46%) and three (40%) clusters were voted by the majority showing a general consensus.Considering all the above strong indications, we need to assess the dataset and the resulting subgroups for potential biases before relying on the cluster memberships.
Towards that end and utilising clinical data, the association results show the dataset might be biased based on technical batches with sequencer machine (Figure 6E) and flowID (Figure 6F) presenting significant differences between clusters (1.39e-03 and 2.55e-06, respectively) with hospital location coming close to significance with p-value = 0.072 (Supplementary Table 8).Additional statistical tests and regression analysis with maternal and foetal physiological and clinical phenotypes did not show any association with the clusters.The expression profiles of the two clusters show visible differences (Figure 6D) as do the t-SNE and PCA analyses (Figure 6H) with the first principal component of the latter explaining 79% of the variance in the GUSTO dataset.

Discussion
Our toolkit is designed to answer multiple questions arising during transcriptomic exploratory studies that target to uncover heterogeneity and subtypes within any condition that might be driven by expression changes.With the plurality of unsupervised methods available and their specialised nature, the selection of the most appropriate approach is a multi-factor problem.A lot of technical decisions are required in the procedure starting with a dataset and completed with a meaningful set of subgroups.To assist with this problem, our toolkit initially assesses the potential of a target dataset and provides estimates of the most appropriate method, gene set and number of subgroups finally outputting a partition based on optimised parameters unburdening the user of specialised decision making (Figure 1).Applying unsupervised learning on expression datasets is often not a straightforward task as it contains the element of uncertainty mainly introduced by the lack of knowledge on the data points.
No methods or metrics can give a definitive answer to the main clustering questions, as presented in previous sections, therefore each tool has to be used with caution, i.e. determining the dataset clustering potential is an indication rather than a clear sign that partitioning the dataset will yield informational subgroups.Additionally, clustering can often contain non-deterministic steps allowing for each function to behave slightly differently between similar runs.To reduce the uncertainty and provide a reliable set of tools, this toolkit has been applied on various gene expression datasets where its efficiency has been demonstrated.However, it is important to note that despite the use of multiple methodological approaches within this toolkit the inherent exploratory characteristics of clustering do not allow for clusters of definite value, instead they are meant to be dealt with scientific caution and biological validation.Aside from the actual memberships, the functions in this package can also reveal useful information about the input data.The GUSTO RNAseq dataset showed that biases can be discovered by applying simple tests, such as PCA or tSNE, in conjunction with the cluster members.It is also possible for Omada to hint towards the existence of a single class, and therefore no heterogeneity, by consistently revealing low partition agreement and stability scores across multiple functions, as demonstrated in our single-class dataset example.Furthermore, Omada can help in selecting a small group of genes with potential partitioning capabilities as the feature selection step is expected to greatly reduce the number of genes which in most cases count to thousands.This toolkit is currently based on specific clustering techniques and metrics but its modular nature allows its extension to accommodate different data types that come in the form of continuous numeric values such as microRNA, metabolite or single cell RNA datasets.Furthermore, the structure of this toolkit allows for additional approaches to be added in the future to the pool of clustering algorithms to be tested keeping up with the current state of the art techniques.
special series or article collection?Experimental design and statistics Full details of the experimental design and statistical methods used should be given in the Methods section, as detailed in our Minimum Standards Reporting Checklist.Information essential to interpreting the data presented should be made available in the figure legends.Have you included all the information requested in your manuscript?Yes Resources A description of all resources used, including antibodies, cell lines, animals and software tools, with enough information to allow them to be uniquely identified, should be included in the Methods section.Authors are strongly encouraged to cite Research Resource Identifiers (RRIDs) for antibodies, model organisms and tools, where possible.Have you included the information requested as detailed in our Minimum Standards Reporting Checklist?Yes Availability of data and materials All datasets and code on which the conclusions of the paper rely must be either included in your submission or deposited in publicly available repositories (where available and ethically appropriate), referencing such data using a unique identifier in the references and in the "Availability of Data and Materials" section of your manuscript.Have you have met the above requirement as detailed in our Minimum Standards Reporting Checklist?Yes Powered by Editorial Manager® and ProduXion Manager® from Aries Systems Corporation Powered by Editorial Manager® and ProduXion Manager® from Aries Systems Corporation

Figure 1 :
Figure 1: An overview of steps for discovering gene expression subgroups using Omada's clustering tools.Processing and quality control should precede the application of Omada to ensure more accurate inputs.Step 1 of Omada allows a feasibility analysis to ensure that the input data is suitable for clustering.Step 2 selects the clustering methodology that provides the most consistent partitions for downstream analysis.In step 3, the genes that best discriminate samples and yield the most stable clusters are determined.Step 4 estimates the number of clusters (k) through majority voting by internal machine learning indexes.The final output of step 5 provides a cluster assignment to each sample driven solely by its expression profile.Percentage thresholds are recommended values that can be modified.

Figure 2 :
Figure 2: Sample selection overview.(A) Ranking of samples based on their variance across features and the subsequent generation of datasets of increasing size.(B) Calculation of the stability score of each generated dataset.Initially, we select a cluster range to run our clustering method for each dataset.After the clustering procedure, we calculate and average the stability over the generated clusters.Finally, we average the stabilities over k per dataset and determine a final stability score for each dataset.The features of the dataset with the highest stability are the ones that compose the most appropriate set for the downstream pipeline For the formulas: k = number of clusters, n = number of data points, E T = sum of the distances of all the points to the barycenter G of the entire dataset, E W = sum of the distances of the points of each cluster to their barycenter, NB = pairs constituted of points which do not belong to the same cluster, NB = pairs constituted of points which belong to the same cluster, NT = NW + NB , SW = sum of the NW distances between all the pairs of points inside each cluster, SMIN = sum of the NW smallest distances between all the pairs of points in the entire data set, SMAX = sum of the NW largest distances between all the pairs of points in the entire data set, SB = sum of the between-cluster distances, α = weight equal to the value of average scattering of clusters obtained for the partition with the greatest number of clusters.

Figure 3 :
Figure 3: Performance criteria for single-class simulated dataset.The results demonstrate low scores for the majority of steps.A) shows the average partition agreement of all three algorithms below the 52% mark indicating very unstable clustering runs overall.Utilising spectral clustering (highest partition agreement) for all downstream steps, in B) we observe that the stability of every possible subset of genes does not surpass 51.3% underlying overall unstable clusters.C) shows five clusters as the first estimate (voted by 8 metrics), significantly different from the one class this dataset contains (in this case 14 scores were used as certain metrics are (infrequently) unable to produce a score and are therefore omitted from voting).

Figure 5 :
Figure 5: Performance criteria for two heterogeneous datasets, simulated multi-class and PANCAN dataset.The multi-class dataset contains artificial samples from five distinct clusters and the PANCAN dataset is composed of three different cancer types presenting biological heterogeneity.A) The agreement between the predicted and true clusters (Adjusted Rand Index) from three different clustering algorithms (HC: hierarchical, KM: K-means, SC: spectral clustering) applied to the two datasets.B) Shows the real number of clusters for the dataset (black text) and the two most likely number of clusters k, with estimates of their percent probability.C) The contingency tables of the combinations between generated clusters (1st estimates) and real classes in the datasets.Darker red colour intensity denotes higher frequency.

Figure 6 :
Figure 6: Performance criteria for PAH and GUSTO datasets which have no known subgroups.Panel A) depicts the average and max sample set stabilities (percentage) for both datasets.The red dashed line represents the threshold of a stable clustering (60%).The PAH RNAseq dataset contains expression of IPAH patients with panel B) showing the gene expression heatmap and C) survival profiles for discovered subgroups (subgroups 3 and 4 are omitted as they were not used in downstream analysis due to their size).The GUSTO dataset contains expression from healthy maternal whole blood with panel D) showing the gene expression heatmap.The following panels show the distribution of cluster members across E) sequencer machines(chi-square p-value 1.39e-03), F) flow IDs (chi-square p-value 2.55e-06) and G) hospitals where the data were collected (chi-square p-value 0.072).H) t-SNE and PCA plots of the expression profiles with labelling of the two discovered subgroups.

Figure 2 Figure 3 Figure 5
Figure 2 Click here to access/download;Figure;Figure 2.pdf

Figure 6
Figure 6 Click here to access/download;Figure;Figure 6.pdf

Table 1 :
Comparison of clustering methodologies towards identifying heterogeneous patient groups.Five steps of the clustering process are reflected in the columns below.Feasibility analysis assesses the potential of the dataset to be utilised efficiently by clustering methods based on its dimensions.The clustering algorithm denotes the amount of algorithms available in the methodology.Feature selection allows the selection of the features that best discriminate samples during clustering.. Cluster quality describes the methods/metrics used to validate the quality of the clusters produced.Number of clusters describes the approach a methodology follows to estimate the most probable number of different groups in the dataset.
To assess the clustering feasibility of an existing dataset this tool kit also provides a similar function which generates a simulated dataset based on an input dataset and the user's estimation of the number of classes.The number of samples and genes equal those of the input dataset and its mean (minput) and standard deviation (sdinput) affect those of each generated class within the dataset.Specifically, if  ∈ (1,2,3, . . . ) is the number of classes, each class mean (mclass) equals   * 10 *  and each class standard deviation (sdclass) equals   * 2 * .

Table 2 |
The clustering algorithms, their approach category and the various distance measures tested This step does not require any deep knowledge or filtering decisions by the user.Based on this observation our sample selection step, which is a part of the tool for bootstrap resampling of features presented in Figures1 and 2A, first ranks features in a descending order of variance (var function from the Stats R package) across samples, generating a list of the most variable features.Subsequently, multiple datasets of all samples and subsets of features are generated.All subsets draw a different number of features from the top of the variance list with replacement.The first dataset uses a relatively small number of features (n), depending on the total number of features (N) and the granularity of the result desired.The following datasets re-draw from the initial list increasing the number of features by n, ending up with

Table 3 :
The list of 15 internal indexes used to estimate the optimal number of clusters (k).All indexes are using different formulas to score a partitioning, measuring one or both of the following concepts: a) how compact each cluster is and b) how well the clusters separate.For each index we present which value is preferred (min or max) and its source.
All individual tools are computed internally and therefore do not require prior deep knowledge of machine learning by the user.All results, intermediate and final, are observable and each step is justified by multiple measures and indices representing widely used machine learning techniques.